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Abstract 



When dealing with time series with complex non-stationarities, low retrospective 
regret on individual realizations is a more appropriate goal than low prospective 
risk in expectation. Online learning algorithms provide powerful guarantees of 
this form, and have often been proposed for use with non-stationary processes 
because of their ability to switch between different forecasters or "experts". How- 
ever, existing methods assume that the set of experts whose forecasts are to be 
combined are all given at the start, which is not plausible when dealing with a 
genuinely historical or evolutionary system. We show how to modify the "fixed 
shares" algorithm for tracking the best expert to cope with a steadily growing set 
of experts, obtained by fitting new models to new data as it becomes available, 
and obtain regret bounds for the growing ensemble. 



1 Introduction 

Non-stationarity is ubiquitous in the study of real time series; macroeconomic statistics, climate 
records and gene expression levels are all prominent examples, as are important engineering prob- 
lems of signal processing and anomaly detection. Sometimes the nonstationarity is harmless, as 
when the data come from a homogeneous Markov process, or more generally from a conditionally 
stationary 1 3 1 source, since then the best prediction for each historical context is invariant, though 
various contexts become more or less common. More generally, however, non-stationary processes 
have trends, so the predictive implications of any given historical context changes over time. 

Time series textbooks (such as ifTTl ) advise turning non-stationary processes into stationary ones, 
by, e.g., subtracting off trends and then analyzing the residuals as a stationary process. If there 
are multiple independent replicas of the process, all with the same trend, the latter could be esti- 
mated non-parametrically. If there is only one realization, systematically estimating the trend needs 
a well-specified parametric model embracing both trend and fluctuations. Time series from complex 
systems, however, typically lack parametric models of trends deserving much credence. In macroe- 
conomics, for instance, the state-of-the-art models are all for stationary fluctuations, and trends are 
identified by ad hoc procedures, most often spline smoothing 1 6 1 . 

The fundamental problem is that in many complex, evolving systems, the low-dimensional variables 
we happen to measure may develop in basically unpredictable ways. Old patterns may become 

'Under the alias of the "Hodrick-Prescott filter" 1 14|. 
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completely irrelevant, even actively misleading. We could try to identify change-points and start 
modeling afresh at each break, but there is often little a priori reason to think that non-stationarities 
will take the form of abrupt breaks, as opposed to more gradual transitions, to say nothing of all of 
the difficulties which plague change-point detection. 

While a non-stationary process could evolve in a totally capricious fashion, more often there are 
at least local periods where the predictive relationship between history and future does not change 
too rapidly. For stationary processes, this relationship is fixed and can be learned nonparametrically 
lfT3l [T1. leading to forecasts with low risk, i.e., low expected loss on new data. In contrast, we follow 
the individual-sequence forecasting literature |4| in wanting to have low regret relative to a given 
collection of models — no matter what sample path the process realizes, we want to have done 
nearly as well the model, or sequence of models, which in hindsight proves to have forecast best. 
It is hard to see how we could go beyond bounds on retrospective regret to bounds on prospective 
risk in the face of arbitrary, unknown non-stationarities; to do so would be tantamount to solving the 
problem of induction. 

We start from algorithms for "prediction with expert advice," which adaptively combine the fore- 
casts of an ensemble of models or "experts" so as to guarantee low regret. We focus on versions of 
the "exponentially-weighted average forecaster" 11121 [T9l (or "multiplicative weight training" [2 1), 
which forecasts a weighted combination of the predictions of the experts in its ensemble, with 
weights being multiplied up or down as experts do better or worse than the ensemble average. Us- 
ing q experts over n rounds, this guarantees a regret, with respect to the retrospectively-best single 
expert, of no more than 0{y/n In q). 

If instead of combining individual experts, we combine sequences drawn from an ensemble of base 
experts, we can "track the best expert". Specifically, the regret compared to a sequence where 
the base expert is switched at most m times follows the same form, but with the number of such 
sequences in place of the number of base experts q\ some combinatorics lH gives a bound of 



0(Wn f (m + l)lng + (n - with = -plnp - (1 - p)ln(l - p) being the 



binary entropy function, appearing here via Stirling's formula. The "fixed shares" algorithm intro- 
duced by m implements this with only q weights, not a combinatorially-large number. 

These algorithms are not quite suitable for the problems we have in mind, however, because they 
presume that all experts in the ensemble are present at the start. Low regret relative to such an 
ensemble is not very comforting: none of the experts might be much good, because one is faced with 
conditions very different from any anticipated when the experts were set up. One could allow each 
expert to adapt — rather than being a fixed forecasting rule, regard each expert as estimating some 
statistical model from (some part) of the sequence, and then forecasting on that basis. This actually 
requires no change to results for, for example, the fixed-share forecaster (see below), because the 
conditions of the theorems put no limit on how the experts' forecasts depend on the past, just that 
they do (measurably). 

Our proposal therefore is to grow the ensemble, adding a new expert every t time-steps. To cope 
with non-stationarity, which would mean that old data becomes irrelevant, the expert added at time 
kT is fitted to the data from (fc — 1)t + 1 onward^ and thereafter is free to keep on updating its 
parameter estimates and, of course, its predictions, using new data. As the ensemble grows, the 
oldest model is always fitted to the complete time series, followed by successively younger models 
which omit more and more of the oldest data, until the very youngest model only fits to the last t 
steps or less. There is thus always an expert which is fitted to the whole data stream, and at the other 
extreme an expert fitted only to the most recent data. If we can prove a regret bound for this growing 
ensemble, we will have something which performs (nearly) as well as a rule which uses the whole 
of the data, presumably optimal in the stationary case, as well as performing (nearly) as well as an 
expert using only the last r observations, as would be suitable in case of a profound change-point or 
structural break. 

We show how to modify the fixed shares algorithm to efficiently work with such an ensemble, while 
still providing an o(n) bound on tracking regret. S|2]fixes the setting and notation. ^j3] introduces the 
exponentially-weighted forecaster over expert sequences drawn from a growing ensemble, and our 
modification of the fixed-shares forecaster. The major results, the equivalence of the two forecasters 

^The very first expert is initialized with some default parameter setting. 
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and the regret bound, follow in §^ 3.4 - 3.5 SQpresents an empirical example from macroeconomics. 
SjSjcontrasts our approach with previous work, and discusses its methodological significance. 



2 Setting and Notation 

We follow the usual setting of individual-sequence forecasting ID. At each discrete time t £ 
1, 2, ... n. Nature produces an observation yt G y. Nature may be deterministic, stochastic, or 
even a clever and deceitful Adversary. Our forecaster has access to a set of experts (for us, a set 
depending on t), with the i^^ expert predicting /i t € V. (The "prediction" could be an action, 
but for concreteness we will only talk about predictions.) The forecaster also has available the data 
yi,y2i ■ ■ ■ Vt-ii and combines this, along with the advice of the experts, to give a prediction pt G T). 
After the forecaster makes its prediction, it learns yt, leading to losses £{fi^t, yt) for the experts and 
£{pt, yt) for the forecaster. 

The aim of the forecaster is to have predicted almost as well as the best expert, or even the best 
sequence of experts, no matter what Nature does. The tracking regret of the forecaster with respect 
to a sequence of experts ii,i2, ■ ■ ■ in is the difference in their cumulative losses: 

n 71 

R{ii,i2, ■■■in) ^ ^e{pt,yt) - ^Kht,t,yt) (1) 
t=i t=i 

Good forecasting strategies have regrets which can be bounded uniformly over both expert sequences 
and observation sequences yi, ■ ■ ■ yn^ Ideally the bound would be o{n), so that the regret per unit 
time goes to zero; in that case the forecaster is "Hannan-consistent." 

Some convenient abbreviations: y* is the sub-sequence of observations ys,ys+i, ■ ■ ■yt-i,yt, and 
likewise for other sequence variables. Further abbreviate £{fi,t, yt) by £{it, yt), and J2r=s ^("^r, yr) 
by , y/). Regret is then 

(2) 



2.1 The basic forecasters 



The exponentially weighted average forecaster lfT2l[T9l Given an ensemble of q experts, initial 
(positive) weights i^i.o- and a learning rate > 0, this forecaster predicts by a weighted average^ 



Pt = — (3) 



El=i wj.t-i 

and updates the weights by 

w,,t = u;,,t_ie-''^(-^--^'' (4) 

This can be seen as a version of reinforcement learning, or as Bayes's rule (if £ is negative 
log-likelihood), or as the evolutionary replicator dynamic, with time-dependent fitness function 
g-v^Ui-.tm) [removed for anonymous submission]. 



As mentioned above, the regret of the EWAF is 0{y/n Inq) f4|. If each member of the EWAF's en- 
semble is actually a sequence over some class of base experts, we get a forecaster which can keep low 
regret even if the best expert to use changes; the cost, however, is keeping around a combinatorially- 
large number of weights. The fixed shares forecaster, described next, achieves the same results with 
only one weight for each base expert, by modifying the manner in which weights are updated. 



The fixed shares forecaster |9| We have q experts, each with a time-varying weight, and the 
forecast is, as before, a convex combination: 

_ ZLi Wi,t^ifi,t 

Pt v^o 

Lj=i wj,t-i 

^If convex combinations over V do not make sense, we use a randomized forecaster, complicating the 
notation a little and requiring us to bound expected regret rather than actual regret. (By Markov's inequality, 
low expected regret implies low realized regret with high probability.) 
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Initially, all weights are equal, o — The update equations are 



«;,,t = (l-a)«,,t + a£^i^ (6) 

where 

v,^t = w,^t^^e-^'^''y'~^ (7) 

and a £ [0, 1] is another control setting. In words, weights update almost exactly as in the EWAF, 
except that weight is shared so that no expert ever falls below a fraction a of the total weight. As 
shown in [91, this matches the behavior of exponential weighting over expert sequences, provided 
the initial weights of sequences are not all equal; the weights are in fact chosen to depend on the 
number of times the sequence changes expert, peaking when the number of switches is about an. 



3 Growing Ensemble Forecasters 



3.1 The Growing Ensemble 

We start with a single expert. We divide the time series into "epochs," each of length r, and add a 
new expert at the beginning of each epochj^ When added, the new expert is trained only on the data 
in the previous epoch. The number of experts at time t is, qt = 1 + [^/'''J- By ti^is n, when the 
ensemble has qn = 1 + [^J experts, one is trained over all data from time 1 to n, one on data from 
1 + T to n, one on 1 + 2r to n, and so on, down to one trained on the last n mod r observations. 
The hope is that this will let us cope with abrupt structural breaks (within at most r time-steps), 
gradual drift, and, of course, actual stationarity. 



3.2 Exponentially- Weighted Averaging over the Growing Ensemble 

To obtain a low tracking regret, we wish to run EWAF over sequences of experts from the growing 
ensemble, limiting it, of course, to only using experts which are currently available. During the 
first r time steps, there is only one expert, but either of two experts can be used at any time from 
t — l + Ttot = 2t, any of three experts from t = 1 + 2r to 3r, etc. Even limiting ourselves to 
sequences which switch experts no more than m times still leaves a combinatorially-large number 
of base-expert sequences, though smaller than what would be the case if all qn final experts were 
available from the beginning. 

We will write the weight of the expert sequence at time t as iptiii)- It is of course 

Mi"i) = <Pt-M)e-^'^'^'y^'^ = 0oW)e-''<'-^^) (8) 

We may regard 0t as a measure on the space of expert sequences of length n, which defines measures 
on sub-sequences by summation; by a slight abuse of notation we will also write them as (^i, so 

We propose the following scheme of initial weights (/)o(i"). Its main virtue is that it can be emulated 
by a direct modification of the fixed-shares forecaster, described immediately below. We prove the 
emulation result as Theorem [T] 

= < A if i mod T = and it+i = gt+i (9) 
'^o(*i) [ ^ + (1 - otherwise 

That is, /3t controls the weight assigned to an expert when it enters the ensemble (and has no track 
record of losses). The choice of /3 is an important issue, to which we return in the conclusion. For 
the rest of this paper, however, we set /3 ^jf^' ^° '■h^'- 

+ (10) 



■*Only trivial changes are required to begin with go experts and add c new experts every epoch. 
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We abbreviate Ijij^j^ij by xt, suppressing explicit dependence on the sequence of actions. Setting 
the base condition 00(1) — 1 (because every sequence must begin with the single expert available 
at the start), this recursively defines the initial weights for aU sequences of experts: 

0o(l) = 1 (11) 
M^i"') = 0o(i*i)f— + (l-«)xt) (12) 
with the restriction it < qt understood. 

3.3 Growing-ensemble fixed sliares forecaster 

The number of sequences of length n from the growing ensemble is too large to keep track of weights 
for each one, so, following the lead of [9 1, we introduce a fixed-shares procedure which will turn out 
to match the weights induced by Eqs.[8]and[T2| 

At time t, each of the qt experts has a weight Wi t- Initially, wi q = 1. Thereafter, weights update 
following the static ensemble procedure almost exactly. For 1 < i < qt, 

v,,t = w,,t-ie-"^(*'2'') (13) 

a 

Wi^t = (l-a)wi,tH y^Vi^t (14) 

and Wit = for i > qt- Prediction, as always, is a convex combination, pt = 

3.4 Equivalence of Fixed Shares and Exponentially- Weighted Sequences 

Following |9|, we show that the fixed shares algorithm assigns the same weight to any given base 
expert, at any given time, as it gets from the exponentially-weighted averaging forecaster applied to 
base-expert sequences. This implies that they have the same behavior, and in particular the same 
regret bounds. Our proof is based on that from |4, Theorem 5. 1, p. 103]. 



13 



14. 



then 



Theorem 1 Let (f>j_t — 'I2i"-it+i=j If4'o by Eq. 12 and w updates by Eqs. 

for all j and t and y", (pj^t — Wj^f 

Comment: Recall that the EWAF will use (/)f_i, not to make its forecast at time t. 

Proof By induction on t. When i = 0, by construction, wi q = 1, and Wj q ~ for all j > 1. But 
this is true for (/)j,o as well, by Eq. 1 1 For the inductive step from t — 1 to t, assume Wj,s = 4>j.s for 
all j and for all s < t. Write (pi^t as a "sum over histories", using Eq.|8] 



„ t 

by Eq. [12] Moving the losses through time t — 1 into the weights, 

= V w,,,t^ie-^'^'^'y^^ f- + (1 - a)Xt 
where we replace 4>it,t-i with Wi^,t^i by the inductive hypothesis. 



y^v^.t( — + {l-a)xt]=w^^t (15) 



byEqs.[T3HT4] □ 
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3.5 Regret bounds for the modified forecasters 

The size (T(i") of a sequence of experts is the number of times the expert used changes, = X)"=i^ Xt- 
We require this to be < m. Let mk be the number of switches within the k*"^^ epoch, i.e., a{i'l'^) — 
(t{i\ ), so nik = m. 

Tlieorem 2 For all n > 1 and y", the tracking regret of the growing ensemble fixed shares fore- 
caster is at most 

R(i") < — lng„ - - lna™(l - a)""" + (16) 
■q 7] 8 

for all expert sequences i" where m — cr(i"). 

Proof (after iU, Theorem 5.2): The key observation, proved as e.g. Lemma 5.1 in |4|, is a general 
bound for exponentially-weighted forecasters with unequal initial weights, which relates their loss 
to the sum of the weights: 

Since weights are non-negative and In is an increasing function, this implies 

^(Pi,2/J)<-^ln0„(z?) + ^n (18) 
for any action sequence i". By the construction of the exponentially weighted forecaster. 

Assuming cr(i") < m, the initial log weight is bounded by construction: 



k=l k=l 



(20) 



Substituting Eq. 20 into Eq.[T9l and the latter into Eq. 18 we get that 



iip\,y'i) < + -lng„ ~-\na"\l - a)"-" + (21) 

r] 7] 8 

and the theorem follows. □ 

The familiar regret bound for the exponentially- weighted forecaster with equal initial weights is that 
R is at most 0{\Jn IniV), with N being the size of the ensemble. Since the number of allowable 
expert sequences of length n with to switches is at most ((7„)™("^^^^), we would be doing well to 

achieve a regret bound of 0{^n(\og ("^/) +TOlng„)). This can in fact be done by tuning a and 

Corollary 1 Fix n and to, and run the modified fixed share forecaster with a = cind 

V = y^f ((n - l)H(S) - ln(l - a) + TO liig„), (22) 



then 



R{ii) < J'^{{n-l)H{a)-ln{l-5)+mlnqn) (23) 



far any action sequence making at most to switches. 
Proof (After H Cor. 5.1, p. 105]): Let a = Then: 

Inf -r — ; — I = — mlnS — (ri — m) In (1 — 3) 



- a) 



— mlnS — (n — to — 1) In (1 — a) — In (1 — a) 
{n- l)H{a) -\n{l - a). 
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Figure I: a: Quarterly growth rate of US GDP, 1948-2010, with predictions of the growing ensemble 
of AR(12) models and the weighted ensemble forecast, b Accumulated regret of the ensemble, 
compared to combining global spline smoothing with an ARMA(8,7) model. 



Substituting rj into the regret bound, and using this equality, we are done. □ 

Remark 1: Notice that for fixed m, a as n oo, so that H{a) and the over-all bound is 

o{n). 

Remark 2: The learning rate and minimum share could probably be tuned better by more careful 
counting of the number of size-m sequences from the growing ensemble, but since this will only 
improve the comparatively-small ni In g„ term, we omit the combinatorics here. 



4 Example: GDP Forecasting 

We illustrate our approach by predicting a non-stationary time series of great practical importance, 
the gross domestic product (GDP) of the United States, recorded quarterly from the second quarter 
of 1947 to the first quarter of 2010 (from the FRED data service of the Federal Reserve Bank of St. 
Louis). After following the common practice of converting this to quarterly growth rates, this gives 
n = 252 observations. Somewhat arbitrarily, we made all of our models linear autoregressions of 
order 12 (i.e., AR(12) models), set the epoch length r to 16 quarters or 4 years, and allowed m = 15 
switches of expert, with a and rj then following by Corollary[T] 

Figure [TJi shows the evolution of GDP growth (clearly non-stationary), as well as the evolution of 
the ensemble and its weighted-average prediction, which does quite well despite the fact that AR 
models are both the simplest possible predictors here, and are all assured mis-specifiecj^ State-of- 
the-art economic forecasts rely on complicated multivariate state-space models called DSGEs |6|, 
after de-trending with a smoothing spline. For GDP, however, the predictions of DSGEs are close to 
those of a simple autoregressive moving average (ARMA) model, so we fit one to spline residuals; 
AJC order selection |17| gave us an ARMA(8,7). Figure [Tf) shows the accumulated loss of the 
ensemble compared to this model; it is both small and growing sub-linearly, despite the fact that the 
ARMA model has much more memory than the ARs (because of the moving-average component), 
and it takes advantage of the flexibility of non-parametric (and indeed non-causal) smoothing in the 
spline. Calculating regret against the best sequence of models from the ensemble, allowing m — n, 
produced a similar profile over time (not shown), but even smaller comparative losses. 



^They are confidentaly rejected by Box-Ljung tests 1171 . 
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5 Discussion 



5.1 Related Work 

The closest approach to our method is that of Hazan and Seshadhri |7|, who also work within the 
family of variants on multiplicative weight training. They introduce a new expert at each time step, 
whose initial weight is a fixed function of time, and do not otherwise implement a "fixed share" of 
weights, i.e., a minimum weight for each expert. Maintaining such a fixed share is extremely useful 
when a pre-existing model becomes one of the best, drastically cutting the time needed for it to 
dominate the ensemble. |7| also does not use tracking regret, but rather the maximum regret against 
any single expert attained over any contiguous time interval. This time-uniform regret is attractive, 
and they prove bounds on it, but only by assuming that each individual expert itself has a low, time- 
uniform regret (in the ordinary sense); some of their results even require low losses, not just low 
regrets. Our approach, by contrast, is able to accommodate the much more realistic situation where 
each individual expert may indeed have high loss, or even high regret, because the process is hard 
to predict and no one model is uniformly applicable. 

Turning to more conventional approaches, econometrics has a large literature on detecting non- 
stationarity (of the basically-harmless "integrated" type characteristic of random walks), and find- 
ing "structural breaks" (change points), after which models must be re-estimated or re-specified |5 |. 
Economists do not seem to have considered an ensemble method like ours, perhaps due to their laud- 
able (if unfulfilled) ambition to capture the exact data-generating process in a single parsimonious 
model. Similarly, most work on data-set shift and concept drift in machine learning 1 15 1 deals with 
how a single model should be learned (or modified) so as to be robust to various changes in the joint 
distribution of inputs and outputs. Unlike all these approaches, we do not have to assume that any of 
our models are well-specified, nor assume anything about the nature of the data-generating process 
or how it changes over time. 

There are some ensemble methods which are reminiscent of aspects of our proposal, such as Kolter 
and Maloof's "additive expert ensemble" algorithm AddExp ifTTIl . the incremental-learning SEA 
algorithm ifTSl . and adaptive time windows algorithms (e.g. iTF]). None of these allow the full com- 
bination of a growing ensemble with temporally-specialized experts and adaptive weights. Con- 
sequently, while some of them can handle mild non-stationarities if the base models are close to 
well-specified, none of them are able to make strong individual-sequence prediction guarantees like 
those of Theorem |2] 



5.2 Conclusion 

We have introduced the growing-ensemble method, and shown that it leads to a modification of the 
conventional fixed-shares forecaster which is still Hannan-consistent, with o{n) regret over n time- 
steps compared to the retrospectively-best sequence of experts. This bound takes into account the 
fact that the ensemble grows continually and that individual experts can be arbitrarily bad, while the 
time series can have arbitrary non-stationarities. There are several interesting technical directions 
in which to take this (Can the counting of experts be replaced with variation of losses across the 
ensemble, as in [8J ? Would it help to vary the weight with which new experts get introduced? Is 
there an optimal epoch length r?), the real importance of this work is methodological. 

Complex systems tend to produce time series which are not just non-stationary but genuinely evolu- 
tionary — even if there is, in some sense, a fixed high-dimensional generative model, the dynamics 
of the low-dimensional variables we deal with changes in character over time. Tractable prediction 
models for such time series are at best local and transient approximations, no single one of which 
will work well for long. It is implausible even to come up with a fixed collection of models before 
we see how the system actually develops. Our growing ensemble method accommodates arbitrary 
dynamics, without assuming well-specified models, trends that can be extrapolated, stationary be- 
havior punctuated by well-defined structural breaks, or other such props supporting previous work. 
Giving up the desire for the One True Model, of minimal risk, in favor of a growing ensemble of 
imperfect models, means we adapt automatically to arbitrary, historically evolving non-stationarities 
— including stationarity. 
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